Scientific workflow systems: Pipeline Pilot and KNIME
نویسنده
چکیده
There are many examples of scientific workflow systems [1, 2]; in this short article I will concentrate only on cheminformatics applications and the workflow tools most commonly used in cheminformatics, namely Pipeline Pilot [3] and KNIME [4]. Workflow solutions have been used for years in bioinformatics and other sciences, and some also have applications in so-called ‘‘business intelligence’’ and ‘‘predictive analytics’’. Readers can find details of Discovery Net, Galaxy, Kepler, Triana, SOMA, SMILA, VisTrails, and others on the Web. Kappler has compared Competitive Workflow, Taverna and Pipeline Pilot [5]. Taverna has been widely used in bioinformatics but is also used with the Chemistry Development Kit (CDK) [6, 7]. CDK-Taverna workflows are made freely available at myExperiment.org [8]. (myExperiment.org also includes KNIME workflows.) DiscoveryNet was one of the earliest examples of a scientific workflow system; its concepts were later commercialized in InforSense Knowledge Discovery Environment (KDE). My 2007 review [1] centered on Pipeline Pilot and InforSense KDE; KNIME was then a relative newcomer. In 2009 the loss-making InforSense organization was acquired by IDBS and KDE has made progress in translational medicine [9]. InforSense’s ChemSense [10] used ChemAxon’s JChem Cartridge, and ChemAxon chemical structure, property prediction, and enumeration tools. ChemSense’s three major pharmaceutical customers have turned to other solutions. The InforSense Suite lives on but it not seen as a ‘‘personal productivity tool’’; rather it is integrated into the IDBS ELN platform. KNIME and Pipeline Pilot are now the market leaders in personal productivity in cheminformatics.
منابع مشابه
Open PHACTS computational protocols for in silico target validation of cellular phenotypic screens: knowing the knowns† †The authors declare no competing interests. ‡ ‡Electronic supplementary information (ESI) available: Pipeline Pilot protocols, xls file with the output of the Pipeline Pilot protocols, KNIME workflows, and supplementary figures showing the Pipeline Pilot protocols. See DOI: 10.1039/c6md00065g Click here for additional data file.
Supplementary Figure 1: Pipeline Pilot implementation of Protocol 1. The workflow consists of two data streams. In data stream 1, starting from a compound URI list, the data is retrieved using the Compound Pharmacology: List and Target Classification API calls. The target classification, the target and compound data are written respectively in the data caches 3, 2 and 1. In data stream two the ...
متن کاملEmpowering pharmacoinformatics by linked life science data
With the public availability of large data sources such as ChEMBLdb and the Open PHACTS Discovery Platform, retrieval of data sets for certain protein targets of interest with consistent assay conditions is no longer a time consuming process. Especially the use of workflow engines such as KNIME or Pipeline Pilot allows complex queries and enables to simultaneously search for several targets. Da...
متن کاملA document classifier for medicinal chemistry publications trained on the ChEMBL corpus
BACKGROUND The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publication...
متن کاملRevisiting the dataflow principle for chemical information processing
Dataflow systems, such as Pipeline Pilot or KNIME have become important mainstream tools for data processing in chemistry. These established systems are all implemented relying on a data model emphasizing a strict row/column-centric data table view which does not facilitate interaction with individual chemistry objects, or non-uniform data contents. Resuming our pioneering work which resulted i...
متن کاملKNIME for reproducible cross-domain analysis of life science data.
Experiments in the life sciences often involve tools from a variety of domains such as mass spectrometry, next generation sequencing, or image processing. Passing the data between those tools often involves complex scripts for controlling data flow, data transformation, and statistical analysis. Such scripts are not only prone to be platform dependent, they also tend to grow as the experiment p...
متن کامل